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Abstract 

We describe a model for capturing the statistical structure of local amplitude and 
local spatial phase in natural images. The model is based on a recently devel- 
oped, factorized third-order Boltzmann machine that was shown to be effective 
at capturing higher-order structure in images by modeling dependencies among 
squared filter outputs 1 1 1. Here, we extend this model to Lp -spherically symmet- 
ric subspaces. In order to model local amplitude and phase structure in images, we 
focus on the case of two dimensional subspaces, and the L2-norm. When trained 
on natural images the model learns subspaces resembling quadrature-pair Gabor 
filters. We then introduce an additional set of hidden units that model the depen- 
dencies among subspace phases. These hidden units form a combinatorial mixture 
of phase coupling distributions, concentrated in the sum and difference of phase 
pairs. When adapted to natural images, these distributions capture local spatial 
phase structure in natural images. 



1 Introduction 

In recent years a number of models have emerged for describing higher-order structure in images 
(i.e., beyond sparse, Gabor-like decompositions). These models utilize distributed representations of 
CO variance matrices to form an infinite or combinatorial mixture of Gaussians model of the data O 
[l |. These models have been shown to effectively capture the non- stationary variance structure of 
natural images. A variety of related models have focused on the local radial (in vectorized image 
space) structure of natural images 1 3] SI El El 7 1 . While these models represent a significant step 
forwards in modeling higher-order natural image structure, they only implicitly model local phase 
alignments across space and scale. Such local phase alignments are implicated as being a hallmark 
of edges, contours, and other shape structure in natural images 1 8 1. The model proposed in this paper 
attempts to extend these models to capture both amplitude and phase in natural images. 

In this paper, we first extend the recent factorized, third-order Boltzmann machine model of Ranzato 
& Hinton 1 1| to the case of Lp- spherically symmetric distributions. In order to directly model 
the dependencies among local amplitude and phase variables, we consider the restricted case of 
two-dimensional subspaces with L2-norm. When adapted to natural images, the subspace filters 
converge to quadrature-pair Gabor-like functions, similar to previous work |7|. The dependencies 
among amplitudes are modeled using a set of hidden units, similar to Ranzato & Hinton 1 1 1. Phase 
dependencies between subspaces are modeled using another set of hidden units as a mixture of 
phase coupling "covariance" matrices: conditioned on the hidden units, phases are modeled via a 
phase-coupled distribution ||9l . 
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1.1 Modeling Local Amplitude and Phase 



Our model may be viewed within the same framework as a number of recent models that attempt 
to capture higher-order structure in images by factorizing the coefficients of oriented, bandpass 
filters O m O O |71 [TOl [m . These models are currently the best probabilistic models of natural 
image structure: they produce state-of-the-art denoising |3 1, and achieve lower entropy encodings 
of natural images |5|. They can be viewed as sharing a common mathematical form in which the 
filter coefficients, x, are factored into a non-negative component z and a scalar component u, where 
X — zu. The non-negative factors, z, are modeled as either an independent set of variables each 
shared by a pair of linear components |7 1, a set of variables with learned dependencies to the linear 
components f6l, or as a single radial component (Hill. The scalar factor, u, is modeled in a number 
of ways: as an independent angular unit vector (SlIU, as a correlated noise process | 3 1, as the phase 
angle of paired filters (Tj, or as a sparse decomposition of latent variables 1 10 |. 

By separating the filter coefficients into two sets of variables, it is possible for higher levels of anal- 
ysis to model higher-order statistical structure that was previously entangled in the filter coefficients 
themselves. For example, the non-negative variables z are usually related to the local contrast or 
power within an image region, and Karklin & Lewicki have shown that it is possible to train a sec- 
ond layer to learn the structure in these variables via sparse coding 1 10 |. Similarly, Lyu & Simoncelli 
learn an MRF model on these variables and show that the resulting model achieves state-of-the-art 
denoising |3|. It is generally less clear what structure is represented in the scalar variables u. Here 
we take inspiration from Zetzsche's observations regarding the circular symmetry of joint distribu- 
tions of related filter coefficients and conjecture that this quantity should be modeled as the sine or 
cosine of an underlying phase variable, i.e., u = sm{0) or u = cos{0) |[T2l . 

One of the main contributions of this paper is a model of the joint structure of local phase in natural 
images. For the case of phases in a complex pyramid, the empirical marginal distribution of phases 
is always uniform across a corpus of natural images (data not shown); however, the empirical distri- 
bution of pairwise phase differences is often concentrated around a preferred phase. We can write 
the joint distribution of a pair of phase variables with a dependency in the difference of the phase 
variables as a von Mise^ distribution in the difference of the phases: 

p{0i,0j) = exp(/^cos(i9i -Oj - /i)) (1) 

This distribution is parameterized by a concentration k, and a phase offset /i. Using trigonometric 
identities we can re-express the cosine of the difference as a sum of bivariate terms of the form 
cos{6i) sm{6j). One may view these terms as the pairwise statistics for angular variables, just as 
the the bivariate terms in the covariance matrix are the pairwise statistics for a Gaussian. This logic 
extends to multivariate distributions with n > 2. 

In the next section we describe an extension to the mean and covariance Restricted Boltzmann 
Machine (mcRBM) of Ranzato & Hinton Q. Our extension represents local amplitude and phase in 
its factors. We model the amplitude dependencies with a set of hidden variables, and we model the 
phase dependencies among factors as a combinatorial mixture of phase-coupling distributions I9l. 
This model is shown schematically in Fig.[T] 

2 Model 

We first review the factorized third-order Boltzmann machine of Ranzato and Hinton |T| named 
mcRBM because it models both the mean and covariance structure of the data. We then describe 
an extension that models pairs of factors as two dimensional subspaces, which we call the mpRBM. 
The mpRBM provides a phase angle between pairs. The joint statistics between phase angles are 
not explicitly modeled by the mpRBM. Thus we propose additional hidden factors that model the 
pair- wise phase dependencies as a product of phase coupling distributions. For convenience we 
name this model mpkRBM, where k references the phase coupling matrix K that is generated by 
conditioning on the phase-coupling hidden units. 

^ A von Mises distribution for an angular variable with concentration parameter, k, and mean, /x, is defined 
as := 27r/o(fc) where Io{k) is the zeroth order modified Bessel function. 
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Figure 1 : The mpkRBM model is shown schematically. The visible units Vi are connected to a set 
of factors through the matrix Cifi. These factors come in pairs, as indicated by the horizontal line 
linking neighboring triangles. The squared output of each factor connects to the subspace coupling 
hidden units through the matrix Pfn . Another set of factors is connected to the sine and cosine 
representations of each factor (second layer triangles) through the matrix Qfig. This second set of 
factors connects to the phase coupling hidden units h^. Single-stroke vertical lines indicate linear 
interactions, while double stroke lines indicate quadratic interactions. Note that the mean factors, 
h^, and biases are omitted for clarity. See text for further explanation. 
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2.1 mcRBM 



The mcRBM defines a probabiHty distribution by specifying an energy function E. The probabiHty 
distribution is then given as ex exp(— where v G is the vectorized input image patch 
with D pixels. The mcRBM has two major parts of the energy, the covariance contributions, E^, 
and the mean contributions, E^ . 

The covariance terms are defined as: 

N F D N 

E>, = "2 E '^n E ^/"(E ^^fM^' - E ^nh'n (2) 

n=l /=1 ^=1 n=l 

where P G R^^^ is a matrix with positive entries, is the number of hidden units and is a 
vector of biases. The columns of the matrix C G R^^^ are the image domain filters and their 
squared outputs are weighted by each row in P. The P matrix can be considered as a weighting on 
the squared outputs of the image filters. We normalize the visible units (| | v| |l2 = 1) following the 
procedure in 1 1 1. Each term in the first sum includes two visible units and one hidden unit and is 
a third-order Boltzmann machine. However, the third-order interactions are restricted by the form 
of the model into factors. Each factor, (XliLi C^if^i)'^ is a deterministic mapping from the image 
domain. The hidden units combine combinatorially to produce a zero-mean Gaussian distribution 
with an inverse covariance matrix that is a function of the hidden units: 

E-^ =Cdiag(P/i")C' (3) 

Because the representation in the hidden units is distributed, the model can describe a combinatorial 
number of covariance matrices. 

The mean contribution to the energy, E^ is given as: 

M D M 

with M binary hidden units that connect directly to the visible units through the matrix Wij. 
The terms are the mean hidden biases. The form of the mean contribution is a standard RBM 
|[T3l . Note that the conditional over both sets of hidden units is factorial. 

The conditional distribution over the visible units given the hidden units is a Gaussian distribution, 
which is a function of the hidden variable states: 

M 

p{v\h', h^) ^ 7V(E(^ W,,hJ^), E) (5) 
i=i 

where E is given as in Eq.[3] The mean of the specified Gaussian is a function of both the mean 
and covariance h"^ hidden units. 



The total energy is given by: 

D D 

E{v, h', /i-) = E'{v, h') + ^-(^, /^"^) + - E - E 



with the last two terms a penalty on the variance of the visible units introduced because E^ is 
invariant to the norm of the visible units and biases h^, on the visible units. 



2.2 mpRBM 

A number of recent results indicate that the local structure of image patches is well modeled by Lp- 
spherically symmetric subspaces |6|. To produce Lp- spherically symmetric subspaces we impose a 
pairing of factors into an Lp subspace. The covariance energy term in the mcRBM is thus altered to 
give: 

N F 



D 



N 

E 



(7) 
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Now the tensor C^// is a set of filters for each factor, / spanning the L dimensional subspace over 
the index I. The distribution over the visible conditioned on the hidden units can be expressed as 
a mixture of Lp distributions. Note that the hidden units remain independent conditioned on the 
visible units. 

The optimal choice of L and a is an interesting project related to recent models [[T4l but is beyond 
the scope of this paper. Here, we have chosen to focus on modeling the structure in the space 
complementary to the norms of the subspaces. To achieve a tractable form of the subspace structure 
we select the special case of L = 2 and a = 2. The choice of a = 2 is motivated by subspace-ICA 
models |6| and sparse coding with complex basis functions |7 1 where the amplitude within each 
complex basis function subspace is modeled as a sparse component. 

2.3 mpkRBM 

While the formulation of -spherically symmetric subspace models the spherically symmetric dis- 
tributions of natural images, there are likely to be residual dependencies between the subspaces in 
the non-radial directions. For example, elongated edge structure will produce dependencies in the 
phase alignments across space and through spatial scale |8|. Such dependencies are not captured, or 
are at least only implicitly captured in the mpRBM. By formulating the mpRBM with L = 2 and 
a = 2 we can define a phase angle within each subspace. The dependencies between these phase 
angles will capture image structure such as phase alignments due to edges. We define a new vari- 
able, Xfi, which is a deterministic function of the visible units: Xfi = cos{Of) and Xf2 = sin(6>/). 



where Of = arg f {^^^^ CifiVi) + j(X]i=i Cif2Vd)), and j is the imaginary unit and arg(.) is the 



complex argument or phase. 

We now use a mathematical form that is similar to the covariance model contribution in the mcRBM 
to model the joint distribution of phases. We define the energy of the phase coupling contribution, 
denoted E^, sls, 



with T binary hidden units that modulate the columns of the matrix R G R^^^. The rows of R 
then modulate the squared projections of the vector x through the matrix Q G R^^^^^. The term 
is a vector of biases for the hidden units. Similar to the h"^ terms in the mcRBM, the hk units 
in the mpkRBM contribute pair- wise dependencies in the sine-cosine space of the phases. Pair- wise 
dependencies in the sine-cosine space can be re-expressed using trigonometric identities as terms 
in the sums and differences of the phase pairs, identical to the phase coupling described in Eq. [T] 
Such explicit dependencies may be important to model because edges in images exhibit structured 
dependencies in the differences of local spatial phase. 

Because the phase coupling energy is additive in each term the hidden unit distribution condi- 
tioned on the hidden units is factorial. The probability of a given is given as: 



where the sigmoid, or logistic, function is a{y) = (1 + exp(— ?/)) ^. 

We can see the dependency structure imposed by the units by considering the conditional distri- 
bution in the space of the phases, 6\ 



Therefore, the units provide a combinatorial code of phase-coupling dependencies. The number 
of phase-coupling matrices that the model can generate is exponential in the number of hidden 
units because the hidden unit representation is binary and combinatorial. Again, instead of allowing 
arbitrary three way interactions between the x variables and the hidden units, we have chosen a 




(8) 




(9) 



K = Q di8ig{Rh^)Q' 
p{0\h^) (X exp{-^xKx) 



(10) 
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specific factorization where the squared factors are {^i^i Xl/=i Qfig^fiY- Because there are no 
direct interactions between the hidden units, h^, the model still has the form of a conventional 
Restricted Boltzmann Machine. We call this model a mpkRBM because it builds upon the mpRBM 
and the k references the coupling matrix in the pair- wise phase distribution produced by conditioning 
on . 

Combining the three types of hidden units, h^,h^, and , allows each type of hidden unit to model 
structure captured by the corresponding functional form. For example, the hidden units will 
generate phase dependencies implicitly through their activations. However, if the phase structure of 
the data contains additional structure not captured implicitly by the and hidden units, there 
will be a learning signal for the units. Conversely, the phase statistics that are produced implicitly 
by the and units will be ignored by the terms because the learning signal is driven by the 
differences in the data and model distributions. 

3 Learning 

We learn the parameters of the model by stochastic gradient ascent of the log-likelihood. We express 
the likelihood in terms of the energy with the hidden units integrated out (omitting the visible squared 
term and biases): 



N ^ F 



V L D 1 V« 

F{v)= - ^log(l + exp(-^P^. E(E^^/^^r +^n)) (11) 

n=l ^ /=1 L ^ i=l II II . 

T G L F 

t=l g^l 1=1 f=l 

M D 

- ^log(l + exp(^iy,,^, + 6-)) 

j=i i=i 

It is not possible to efficiently sample the distribution over the visible units conditioned on the 
hidden units exactly (in contrast, sampling from the visible units conditioned on the hidden units in 
a standard REM is efficient and exact). We choose to integrate out the hidden variables, instead of 
taking the conditional distribution, to achieve better estimates of the model statistics. 

Maximizing the log-likelihood the gradient update for the model parameters (denoted as 6 G 

{R, Q, 6^, C, P, 6^ W, 6^, 6^} is given as: 

— \ ^r^/ model \^^)data (tZj 

where indicates the expectation taken over the distribution p . Calculating the expectation over 
the data distribution is straightforward. However, calculating the expectation over the model distri- 
bution requires computationally expensive sampling from the equilibrium distribution. Therefore, 
we use standard techniques to approximate the expectation of the gradients under the model distri- 
bution following the procedure in [15, 1|. To summarize, in Contrastive Divergence learning |13J 
the model distribution is approximated by running a dynamic sampler starting at the data for only 
one step. Given the energy function with the hidden units integrated out, we run hybrid Monte Carlo 
sampling 1 16| starting at the data for one dynamical simulation to produce an approximate sample 
from the model distribution. For each dynamical simulation we draw a random momentum and run 
20 leap-frog steps while adapting the step size to achieve a rejection rate of about 10%. 



3.1 Learning parameters 

We trained the models on image patches selected randomly from the Berkeley Segmentation 
Database. We subtracted the image mean, and whitened 16x16 color image patches preserving 
99% of the image variance. This resulted in D = 138 visible units. We examined a model with 256 
h"^ covariance units, 256 phase-coupling units, and 100 h'^ mean units. We initialized the values 
of the matrix C to random values and normalized each image domain vector to have unit length. 
We initialized the matrices W and Q to small random values with variances equal to 0.05, and 0.1 
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Figure 2: Learned Lp Weights, C: A) and B) show the individual components of each filter pair 
Cifi and Ci/2, respectively. Each subimage shows the image domain weights in the un whitened 

color image space. C) shows the amplitude ^J~{cf^^[^r~cf^ as a function of space, and D) shows 

the phase arg(Ci/i + as a function of space. Each panel preserves the ordering such that the 

image in position (1,1) of A) corresponds to the same subspace of C as the image in position (1,1) 
ofB), C),and D). 
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Figure 3: Learned Mean Weights, W: Each subimage shows a column of W in the un whitened color 
image domain. These functions resemble those found by the mcRBM. 



respectively. We initialized the biases, b^,b^,b^, and 6^ to 2.0, -2.0, 0.0, and 0.0 respectively. The 
learning rates for R, Q, P, C, W, bq, be, bm, bv, were set to 0.0015, 0.1, 0.0015, 0.15, 0.015, 
0.0005, 0.0015, 0.0075, and 0.0015, respectively. 

After each learning update we normalized the lengths of the C vectors to have the average of the 
lengths. This allowed the lengths of the C vectors to grow or shrink to match the data distribution, 
but prevented any relative scaling between the subspaces. After each update we also set any positive 
values in P to zero and normalized the columns to have unit L2-norm. Finally, we normalized the 
lengths of the columns of R to have unit L2-norm. We learned on mini-batches of 128 image patches 
and learned the various parts of the model sequentially. We adapted the parameters of a mpRBM 
model with L = 2 and a = 2 and fixed the matrix P to the negative identity for 10,000 iterations. 
We then adapted the parameters, including P, for another 30,000 iterations. We then added the 
units to this learned model and adapted the values in Q for 20,000 iterations while holding the 
matrix R fixed to the identity. Next we adapted R for 20,000 iterations. Finally, we allowed all of 
the parameters in the model to adapt for 40,000 iterations. 

4 Experiments 

4.1 Learning on Natural Images 

Here we examine the structure represented in the model parameters R, Q, P, and C after training 
the mpkRBM on natural images. The subspace filters in the C learn localized oriented band-pass 
filters roughly in quadrature, see Fig. [2] We have observed that the filters in the matrix C appear 
to learn more textured patterns than those in the mcRBM, but a more rigorous analysis is needed 
to verify such an observation. The weights in the matrix P adapt to group subspaces with similar 
spatial position and spatial frequency. See Fig.|4]for a depiction of the image filters with the highest 
weights to each hidden unit h^. The values in the learned matrix W are similar to those learned by 
the mcRBM and are shown in Fig.|3] 

The learned R and Q weights are harder to visualize as they express dependencies in a layer removed 
from the image domain. However, we can view the subspaces that are weighted highest by each 
column of Q. For each column in Fig. [5] we depict the image domain filters (C) that are weighted 
highest by the corresponding column in Q. We similarly show the image domain filters that are 
weighted highest by each column in R in Fig. [6] 

5 Discussion 

The mpRBM and mpkRBM suggest a number of interesting future directions. For example, it should 
be possible to learn the dimensionality of the subspaces by introducing a weighting matrix in the L 
dimensional space of the tensor C. However, it is not clear how to define an appropriate angle within 
these subspaces for the phase-coupling factors. Although, it would be reasonable to learn a separate 
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Figure 4: Learned Lp groupings, P: A random selection (out of 256) columns in P. Each column 
depicts the top 6 weighted subspaces in C for a specific column in P. Each subspace in C is two- 
dimensional and we show the un whitened image domain weights for both subspaces in A) and B). 
The corresponding image domain amplitudes and phases for the subspaces are shown in C) and D). 
There is clear grouping of subspaces with similar positions, orientations, and scales. 
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Figure 5: Learned Phase Projections, Q: The first 32 (out of 256) columns in Q are shown. The 
entries in each column of Q weight the cosine or sine of each subspace. Because the cosine and sine 
correspond to specific vectors in the tensor C, we show the image domain projection of these vectors 
that take the highest weight in the column of Q. In this figure, the 6 image domain projections with 
the highest magnitude weights are shown in the rows for different columns of Q (each is shown in a 
different column of the figure). 
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Figure 6: Learned Phase Coupling, R\ The first 32 (out of 256) columns in R. Each column in R 
produces a different coupling matrix (see Eq.[TO]). The values in this matrix indicate phase coupling 
between pairs of subspaces. Therefore, for the matrix K produced by a specific column of R, we 
find the couplings with the highest magnitude. Given these sorted pairs, we take the unique top 6 
entries. Finally, we plot the image domain filters corresponding to these 6 entries (shown in the 
rows). As in Fig. |5] these entries can be mapped to vectors in the tensor C and thus plotted in the 
image domain. Different columns of R are shown in different columns of the figure. 

set of factors, C, for the units and the units, thus freeing the constraint of L = 2 for the 
units. It may also be possible to extend the mpRBM to a nested form of Lp distributions suggested 
in |[T4l . It is also worth exploring the behavior of the phase-coupling hidden units as the number of 
hidden units is varied. As we had little prior expectation as to the structure of the learned R and Q 
matrices, the choice of 100 hidden units was rather arbitrary. 

6 Conclusions 

In this paper we have introduced two new factorized Boltzmann machines: the mpRBM and the mp- 
kRBM, which each extend the factorized third-order Boltzmann machine (the mcRBM) of Ranzato 
and Hinton 1 1 1. The form of these additional hidden unit factors are motivated by image models of 
subspace structure (61 and phase alignments due to edges in natural images O. Focusing on the 
mpkRBM, we have shown that such a model learns phase structure in natural images. 
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